Author: Vinícius Alves
Date: 03/31/2023
Version: 1.0
Bellabeat is a high-tech company that manufactures health-focused smart products. The company is focused on women and develop gadgets to collect data on activity, sleep, stress, and reproductive health. That has allowed Bellabeat to empower women with the knowledge about their own health and habits.
Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.
To help Bellabeat improve a product, a study was commissioned on user behavior using another company's healthy data collection gadgets.
The questions raised in this analysis were:
The data used in this analysis is from Fitbit's gadgets available on Kaggle.
This dataset was collected by a distributed survey via Amazon Mechanical Turk between March 12, 2016 and May 12, 2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. The data does not have any personal information of the users.
This data is open-source and have the CC0: Public Domain License, which means anyone can copy, modify, distribute, work with it, even for commercial purposes, without asking permission.
The dataset is composed of 18 .csv files. 15 in long format, 3 in wide format.
About bias in the data, I think hat some points need to be discussed:
The data has 33 people data in the majority of the shared datasets, but 33 people is just a piece of the entire population that uses smart gadgets. On some archives we just have 24 or 8 people information. Because of that we cannot say that our data is not biased.
17 of the 18 archives have no missing information, but in the 'weightLogInfo_merged.csv' we only have Fat information of two people. Thus, that information is not supposed to be used.
Most of the datasets have readible understandable variables, but, the 'minuteSleep_merged.csv' has a variable called 'value' that its not clear what is. Maybe it is the minute slept, which in this case, the value 1 make sense, but in some points it has a value of 2 or 3.
After checking Fitabase Data Dictionary , these values can be translate in: 1 = asleep, 2 = restless, 3 = awake.
As this dataset was constructed with the consentiment of a few people that used these gadgets, it is already biased, because they knew that their data was being collected. So, we already know that we have the bias of type of people that used some gadgets and agreed to share the data, and about the change in the behavior in the period that the data was collected?
# Libraries
import pandas as pd
import numpy as np
import statistics as st
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import os
The data source archives used in this study are listed below with some data analysis
# Getting the archives
import fnmatch
extensions = ['*.csv']
archives = []
folder_path = "Data_Coursera_CaseStudy02/"
# Walk through the folder and its subdirectories to find CSV files
for root, dirs, files in os.walk(folder_path):
for csv_extension in extensions:
for filename in fnmatch.filter(files, csv_extension):
csv_path = os.path.join(root, filename)
archives.append(csv_path)
archives
['Data_Coursera_CaseStudy02/dailyActivity_merged.csv', 'Data_Coursera_CaseStudy02/dailyCalories_merged.csv', 'Data_Coursera_CaseStudy02/dailyIntensities_merged.csv', 'Data_Coursera_CaseStudy02/dailySteps_merged.csv', 'Data_Coursera_CaseStudy02/heartrate_seconds_merged.csv', 'Data_Coursera_CaseStudy02/hourlyCalories_merged.csv', 'Data_Coursera_CaseStudy02/hourlyIntensities_merged.csv', 'Data_Coursera_CaseStudy02/hourlySteps_merged.csv', 'Data_Coursera_CaseStudy02/minuteCaloriesNarrow_merged.csv', 'Data_Coursera_CaseStudy02/minuteCaloriesWide_merged.csv', 'Data_Coursera_CaseStudy02/minuteIntensitiesNarrow_merged.csv', 'Data_Coursera_CaseStudy02/minuteIntensitiesWide_merged.csv', 'Data_Coursera_CaseStudy02/minuteMETsNarrow_merged.csv', 'Data_Coursera_CaseStudy02/minuteSleep_merged.csv', 'Data_Coursera_CaseStudy02/minuteStepsNarrow_merged.csv', 'Data_Coursera_CaseStudy02/minuteStepsWide_merged.csv', 'Data_Coursera_CaseStudy02/sleepDay_merged.csv', 'Data_Coursera_CaseStudy02/weightLogInfo_merged.csv']
# Checking the data for bias:
# 1 - Checking the sample size
# For every dataset:
# Check the number of ID (people)
# Check the total line count
# Check the missig values
for archive in archives:
new_df = pd.read_csv(archive)
print(f"archive: { archive.replace('Data_Coursera_CaseStudy02/',' ') }")
print(f'Ids: {len(new_df.Id.unique())}')
print(f'Total line count: {len(new_df)}')
print(f'Missing values: {new_df.isnull().any(axis=1).sum()}')
print("-------------------------------------------")
archive: dailyActivity_merged.csv Ids: 33 Total line count: 863 Missing values: 0 ------------------------------------------- archive: dailyCalories_merged.csv Ids: 33 Total line count: 940 Missing values: 0 ------------------------------------------- archive: dailyIntensities_merged.csv Ids: 33 Total line count: 940 Missing values: 0 ------------------------------------------- archive: dailySteps_merged.csv Ids: 33 Total line count: 863 Missing values: 0 ------------------------------------------- archive: heartrate_seconds_merged.csv Ids: 14 Total line count: 2483658 Missing values: 0 ------------------------------------------- archive: hourlyCalories_merged.csv Ids: 33 Total line count: 22099 Missing values: 0 ------------------------------------------- archive: hourlyIntensities_merged.csv Ids: 33 Total line count: 22099 Missing values: 0 ------------------------------------------- archive: hourlySteps_merged.csv Ids: 33 Total line count: 22099 Missing values: 0 ------------------------------------------- archive: minuteCaloriesNarrow_merged.csv Ids: 33 Total line count: 1325580 Missing values: 0 ------------------------------------------- archive: minuteCaloriesWide_merged.csv Ids: 33 Total line count: 21645 Missing values: 0 ------------------------------------------- archive: minuteIntensitiesNarrow_merged.csv Ids: 33 Total line count: 1325580 Missing values: 0 ------------------------------------------- archive: minuteIntensitiesWide_merged.csv Ids: 33 Total line count: 21645 Missing values: 0 ------------------------------------------- archive: minuteMETsNarrow_merged.csv Ids: 33 Total line count: 1325573 Missing values: 0 ------------------------------------------- archive: minuteSleep_merged.csv Ids: 24 Total line count: 187978 Missing values: 0 ------------------------------------------- archive: minuteStepsNarrow_merged.csv Ids: 33 Total line count: 1325580 Missing values: 0 ------------------------------------------- archive: minuteStepsWide_merged.csv Ids: 33 Total line count: 21645 Missing values: 0 ------------------------------------------- archive: sleepDay_merged.csv Ids: 24 Total line count: 410 Missing values: 0 ------------------------------------------- archive: weightLogInfo_merged.csv Ids: 8 Total line count: 67 Missing values: 0 -------------------------------------------
As you already can see, I will use a Jupyter Notebook to perform this analysis.
First of all, I downloaded the data and put them into a folder called 'Data_Coursera_CaseStudy02', so I will read every file from there.
To ensure that the data is clean, I will perform this steps in every archive:
Is there any duplicated value in the files?
for archive in archives:
new_df = pd.read_csv(archive)
print(f"archive: { archive.replace('Data_Coursera_CaseStudy02/',' ') }")
print(f'Duplicated lines: {new_df.duplicated().sum()}')
print("-------------------------------------------")
archive: dailyActivity_merged.csv Duplicated lines: 0 ------------------------------------------- archive: dailyCalories_merged.csv Duplicated lines: 0 ------------------------------------------- archive: dailyIntensities_merged.csv Duplicated lines: 0 ------------------------------------------- archive: dailySteps_merged.csv Duplicated lines: 0 ------------------------------------------- archive: heartrate_seconds_merged.csv Duplicated lines: 0 ------------------------------------------- archive: hourlyCalories_merged.csv Duplicated lines: 0 ------------------------------------------- archive: hourlyIntensities_merged.csv Duplicated lines: 0 ------------------------------------------- archive: hourlySteps_merged.csv Duplicated lines: 0 ------------------------------------------- archive: minuteCaloriesNarrow_merged.csv Duplicated lines: 0 ------------------------------------------- archive: minuteCaloriesWide_merged.csv Duplicated lines: 0 ------------------------------------------- archive: minuteIntensitiesNarrow_merged.csv Duplicated lines: 0 ------------------------------------------- archive: minuteIntensitiesWide_merged.csv Duplicated lines: 0 ------------------------------------------- archive: minuteMETsNarrow_merged.csv Duplicated lines: 0 ------------------------------------------- archive: minuteSleep_merged.csv Duplicated lines: 0 ------------------------------------------- archive: minuteStepsNarrow_merged.csv Duplicated lines: 0 ------------------------------------------- archive: minuteStepsWide_merged.csv Duplicated lines: 0 ------------------------------------------- archive: sleepDay_merged.csv Duplicated lines: 0 ------------------------------------------- archive: weightLogInfo_merged.csv Duplicated lines: 0 -------------------------------------------
Duplicated rows were founded in two of the datasets:
archive: minuteSleep_merged.csv
Duplicated lines: 543
archive: sleepDay_merged.csv
Duplicated lines: 3
Removing duplicates
# Removing duplicates
for archive in archives:
new_df = pd.read_csv(archive)
if(new_df.duplicated().sum() > 0 ):
print(f"Removing the duplicates from: {archive.replace('Data_Coursera_CaseStudy02/',' ')}")
new_df.drop_duplicates(keep='first',inplace=True)
# Saving the Dataframe without the duplicated values
new_df.to_csv(archive)
# Removing the duplicates from: minuteSleep_merged.csv
# Removing the duplicates from: sleepDay_merged.csv
# Searching for rows or column with no data:
for archive in archives:
new_df = pd.read_csv(archive)
print(f"archive: { archive.replace('Data_Coursera_CaseStudy02/',' ') }")
print(f'Total line count: {len(new_df)}')
print(f'Missing values: {new_df.isnull().any(axis=1).sum()}')
print("-------------------------------------------")
archive: dailyActivity_merged.csv Total line count: 863 Missing values: 0 ------------------------------------------- archive: dailyCalories_merged.csv Total line count: 940 Missing values: 0 ------------------------------------------- archive: dailyIntensities_merged.csv Total line count: 940 Missing values: 0 ------------------------------------------- archive: dailySteps_merged.csv Total line count: 863 Missing values: 0 ------------------------------------------- archive: heartrate_seconds_merged.csv Total line count: 2483658 Missing values: 0 ------------------------------------------- archive: hourlyCalories_merged.csv Total line count: 22099 Missing values: 0 ------------------------------------------- archive: hourlyIntensities_merged.csv Total line count: 22099 Missing values: 0 ------------------------------------------- archive: hourlySteps_merged.csv Total line count: 22099 Missing values: 0 ------------------------------------------- archive: minuteCaloriesNarrow_merged.csv Total line count: 1325580 Missing values: 0 ------------------------------------------- archive: minuteCaloriesWide_merged.csv Total line count: 21645 Missing values: 0 ------------------------------------------- archive: minuteIntensitiesNarrow_merged.csv Total line count: 1325580 Missing values: 0 ------------------------------------------- archive: minuteIntensitiesWide_merged.csv Total line count: 21645 Missing values: 0 ------------------------------------------- archive: minuteMETsNarrow_merged.csv Total line count: 1325573 Missing values: 0 ------------------------------------------- archive: minuteSleep_merged.csv Total line count: 187978 Missing values: 0 ------------------------------------------- archive: minuteStepsNarrow_merged.csv Total line count: 1325580 Missing values: 0 ------------------------------------------- archive: minuteStepsWide_merged.csv Total line count: 21645 Missing values: 0 ------------------------------------------- archive: sleepDay_merged.csv Total line count: 410 Missing values: 0 ------------------------------------------- archive: weightLogInfo_merged.csv Total line count: 67 Missing values: 0 -------------------------------------------
Missing values detected in:
archive: weightLogInfo_merged.csv
Total line count: 67
Missing values: 65
But this feels strange.
Have checked the data, and that is a column with only 2 values.
Need to remove that column, because we will not use that data.
Removing the column 'Fat' from 'weightLogInfo_merged.csv' :
for archive in archives:
new_df = pd.read_csv(archive)
print(f"archive: { archive.replace('Data_Coursera_CaseStudy02/',' ') }")
print(f'Description:')
print(new_df.describe())
print("-------------------------------------------")
archive: dailyActivity_merged.csv
Description:
Unnamed: 0.1 Unnamed: 0 Id TotalSteps TotalDistance \
count 863.000000 863.000000 8.630000e+02 863.000000 863.000000
mean 431.000000 470.404403 4.857542e+09 8319.392816 5.979513
std 249.270937 270.542606 2.418405e+09 4744.967224 3.721044
min 0.000000 0.000000 1.503960e+09 4.000000 0.000000
25% 215.500000 240.500000 2.320127e+09 4923.000000 3.370000
50% 431.000000 471.000000 4.445115e+09 8053.000000 5.590000
75% 646.500000 708.500000 6.962181e+09 11092.500000 7.900000
max 862.000000 939.000000 8.877689e+09 36019.000000 28.030001
TrackerDistance LoggedActivitiesDistance VeryActiveDistance \
count 863.000000 863.000000 863.000000
mean 5.963882 0.117822 1.636756
std 3.703191 0.646111 2.735289
min 0.000000 0.000000 0.000000
25% 3.370000 0.000000 0.000000
50% 5.590000 0.000000 0.410000
75% 7.880000 0.000000 2.275000
max 28.030001 4.942142 21.920000
ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance \
count 863.000000 863.000000 863.000000
mean 0.618181 3.638899 0.001750
std 0.905049 1.857503 0.007651
min 0.000000 0.000000 0.000000
25% 0.000000 2.345000 0.000000
50% 0.310000 3.580000 0.000000
75% 0.865000 4.895000 0.000000
max 6.480000 10.710000 0.110000
VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes \
count 863.000000 863.000000 863.000000
mean 23.015064 14.775203 210.016222
std 33.646118 20.427405 96.781296
min 0.000000 0.000000 0.000000
25% 0.000000 0.000000 146.500000
50% 7.000000 8.000000 208.000000
75% 35.000000 21.000000 272.000000
max 210.000000 143.000000 518.000000
SedentaryMinutes Calories
count 863.000000 863.000000
mean 955.753187 2361.295481
std 280.293359 702.711148
min 0.000000 52.000000
25% 721.500000 1855.500000
50% 1021.000000 2220.000000
75% 1189.000000 2832.000000
max 1440.000000 4900.000000
-------------------------------------------
archive: dailyCalories_merged.csv
Description:
Id Calories
count 9.400000e+02 940.000000
mean 4.855407e+09 2303.609574
std 2.424805e+09 718.166862
min 1.503960e+09 0.000000
25% 2.320127e+09 1828.500000
50% 4.445115e+09 2134.000000
75% 6.962181e+09 2793.250000
max 8.877689e+09 4900.000000
-------------------------------------------
archive: dailyIntensities_merged.csv
Description:
Id SedentaryMinutes LightlyActiveMinutes \
count 9.400000e+02 940.000000 940.000000
mean 4.855407e+09 991.210638 192.812766
std 2.424805e+09 301.267437 109.174700
min 1.503960e+09 0.000000 0.000000
25% 2.320127e+09 729.750000 127.000000
50% 4.445115e+09 1057.500000 199.000000
75% 6.962181e+09 1229.500000 264.000000
max 8.877689e+09 1440.000000 518.000000
FairlyActiveMinutes VeryActiveMinutes SedentaryActiveDistance \
count 940.000000 940.000000 940.000000
mean 13.564894 21.164894 0.001606
std 19.987404 32.844803 0.007346
min 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000
50% 6.000000 4.000000 0.000000
75% 19.000000 32.000000 0.000000
max 143.000000 210.000000 0.110000
LightActiveDistance ModeratelyActiveDistance VeryActiveDistance
count 940.000000 940.000000 940.000000
mean 3.340819 0.567543 1.502681
std 2.040655 0.883580 2.658941
min 0.000000 0.000000 0.000000
25% 1.945000 0.000000 0.000000
50% 3.365000 0.240000 0.210000
75% 4.782500 0.800000 2.052500
max 10.710000 6.480000 21.920000
-------------------------------------------
archive: dailySteps_merged.csv
Description:
Unnamed: 0.1 Unnamed: 0 Id StepTotal
count 863.000000 863.000000 8.630000e+02 863.000000
mean 431.000000 470.404403 4.857542e+09 8319.392816
std 249.270937 270.542606 2.418405e+09 4744.967224
min 0.000000 0.000000 1.503960e+09 4.000000
25% 215.500000 240.500000 2.320127e+09 4923.000000
50% 431.000000 471.000000 4.445115e+09 8053.000000
75% 646.500000 708.500000 6.962181e+09 11092.500000
max 862.000000 939.000000 8.877689e+09 36019.000000
-------------------------------------------
archive: heartrate_seconds_merged.csv
Description:
Id Value
count 2.483658e+06 2.483658e+06
mean 5.513765e+09 7.732842e+01
std 1.950224e+09 1.940450e+01
min 2.022484e+09 3.600000e+01
25% 4.388162e+09 6.300000e+01
50% 5.553957e+09 7.300000e+01
75% 6.962181e+09 8.800000e+01
max 8.877689e+09 2.030000e+02
-------------------------------------------
archive: hourlyCalories_merged.csv
Description:
Id Calories
count 2.209900e+04 22099.000000
mean 4.848235e+09 97.386760
std 2.422500e+09 60.702622
min 1.503960e+09 42.000000
25% 2.320127e+09 63.000000
50% 4.445115e+09 83.000000
75% 6.962181e+09 108.000000
max 8.877689e+09 948.000000
-------------------------------------------
archive: hourlyIntensities_merged.csv
Description:
Id TotalIntensity AverageIntensity
count 2.209900e+04 22099.000000 22099.000000
mean 4.848235e+09 12.035341 0.200589
std 2.422500e+09 21.133110 0.352219
min 1.503960e+09 0.000000 0.000000
25% 2.320127e+09 0.000000 0.000000
50% 4.445115e+09 3.000000 0.050000
75% 6.962181e+09 16.000000 0.266667
max 8.877689e+09 180.000000 3.000000
-------------------------------------------
archive: hourlySteps_merged.csv
Description:
Id StepTotal
count 2.209900e+04 22099.000000
mean 4.848235e+09 320.166342
std 2.422500e+09 690.384228
min 1.503960e+09 0.000000
25% 2.320127e+09 0.000000
50% 4.445115e+09 40.000000
75% 6.962181e+09 357.000000
max 8.877689e+09 10554.000000
-------------------------------------------
archive: minuteCaloriesNarrow_merged.csv
Description:
Id Calories
count 1.325580e+06 1.325580e+06
mean 4.847898e+09 1.623130e+00
std 2.422313e+09 1.410447e+00
min 1.503960e+09 0.000000e+00
25% 2.320127e+09 9.357000e-01
50% 4.445115e+09 1.217600e+00
75% 6.962181e+09 1.432700e+00
max 8.877689e+09 1.974995e+01
-------------------------------------------
archive: minuteCaloriesWide_merged.csv
Description:
Id Calories00 Calories01 Calories02 Calories03 \
count 2.164500e+04 21645.000000 21645.000000 21645.000000 21645.000000
mean 4.836965e+09 1.622629 1.626377 1.637824 1.635515
std 2.424088e+09 1.398418 1.395083 1.408828 1.419590
min 1.503960e+09 0.702700 0.702700 0.702700 0.702700
25% 2.320127e+09 0.935700 0.935700 0.937680 0.935700
50% 4.445115e+09 1.217600 1.217600 1.220400 1.218500
75% 6.962181e+09 1.432700 1.432700 1.432700 1.432700
max 8.877689e+09 19.727337 19.727337 19.727337 19.727337
Calories04 Calories05 Calories06 Calories07 Calories08 \
count 21645.000000 21645.000000 21645.000000 21645.000000 21645.000000
mean 1.637997 1.638306 1.639910 1.629520 1.623686
std 1.433532 1.438253 1.435465 1.424092 1.411596
min 0.702700 0.702700 0.702700 0.702700 0.702700
25% 0.935700 0.935700 0.935700 0.935700 0.935700
50% 1.218500 1.218500 1.218500 1.217600 1.217600
75% 1.432700 1.432700 1.432700 1.432700 1.432700
max 19.727337 19.727337 19.727337 19.727337 19.727337
... Calories50 Calories51 Calories52 Calories53 \
count ... 21645.000000 21645.000000 21645.000000 21645.000000
mean ... 1.623665 1.613643 1.620958 1.618227
std ... 1.407171 1.395206 1.407914 1.400498
min ... 0.702700 0.702700 0.702700 0.702700
25% ... 0.935700 0.935700 0.935700 0.935700
50% ... 1.217600 1.217600 1.217600 1.217600
75% ... 1.432700 1.432700 1.432700 1.432700
max ... 19.749947 19.749947 19.749947 19.749947
Calories54 Calories55 Calories56 Calories57 Calories58 \
count 21645.000000 21645.000000 21645.000000 21645.000000 21645.000000
mean 1.621229 1.615972 1.608714 1.612657 1.611715
std 1.408974 1.392530 1.376827 1.369097 1.374954
min 0.702700 0.702700 0.702700 0.702700 0.702700
25% 0.935700 0.935700 0.935700 0.935700 0.935700
50% 1.217600 1.217600 1.217600 1.217600 1.217600
75% 1.432700 1.432700 1.432700 1.432700 1.432700
max 19.749947 19.749947 19.727337 19.727337 19.727337
Calories59
count 21645.000000
mean 1.612110
std 1.373888
min 0.000000
25% 0.935700
50% 1.217600
75% 1.432700
max 19.727337
[8 rows x 61 columns]
-------------------------------------------
archive: minuteIntensitiesNarrow_merged.csv
Description:
Id Intensity
count 1.325580e+06 1.325580e+06
mean 4.847898e+09 2.005937e-01
std 2.422313e+09 5.190227e-01
min 1.503960e+09 0.000000e+00
25% 2.320127e+09 0.000000e+00
50% 4.445115e+09 0.000000e+00
75% 6.962181e+09 0.000000e+00
max 8.877689e+09 3.000000e+00
-------------------------------------------
archive: minuteIntensitiesWide_merged.csv
Description:
Id Intensity00 Intensity01 Intensity02 Intensity03 \
count 2.164500e+04 21645.000000 21645.000000 21645.000000 21645.000000
mean 4.836965e+09 0.199723 0.203326 0.208177 0.203835
std 2.424088e+09 0.509819 0.515432 0.521394 0.518137
min 1.503960e+09 0.000000 0.000000 0.000000 0.000000
25% 2.320127e+09 0.000000 0.000000 0.000000 0.000000
50% 4.445115e+09 0.000000 0.000000 0.000000 0.000000
75% 6.962181e+09 0.000000 0.000000 0.000000 0.000000
max 8.877689e+09 3.000000 3.000000 3.000000 3.000000
Intensity04 Intensity05 Intensity06 Intensity07 Intensity08 \
count 21645.000000 21645.000000 21645.000000 21645.000000 21645.000000
mean 0.205082 0.204897 0.206560 0.201894 0.202310
std 0.521956 0.521054 0.523053 0.519074 0.522594
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000
max 3.000000 3.000000 3.000000 3.000000 3.000000
... Intensity50 Intensity51 Intensity52 Intensity53 \
count ... 21645.000000 21645.000000 21645.000000 21645.000000
mean ... 0.201016 0.195796 0.198337 0.199399
std ... 0.514814 0.510299 0.511264 0.513331
min ... 0.000000 0.000000 0.000000 0.000000
25% ... 0.000000 0.000000 0.000000 0.000000
50% ... 0.000000 0.000000 0.000000 0.000000
75% ... 0.000000 0.000000 0.000000 0.000000
max ... 3.000000 3.000000 3.000000 3.000000
Intensity54 Intensity55 Intensity56 Intensity57 Intensity58 \
count 21645.000000 21645.000000 21645.000000 21645.000000 21645.000000
mean 0.200139 0.198753 0.195565 0.199122 0.198244
std 0.512142 0.511238 0.506435 0.511907 0.510124
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000
max 3.000000 3.000000 3.000000 3.000000 3.000000
Intensity59
count 21645.000000
mean 0.195426
std 0.503423
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 3.000000
[8 rows x 61 columns]
-------------------------------------------
archive: minuteMETsNarrow_merged.csv
Description:
Unnamed: 0.1 Unnamed: 0 Id METs
count 1.325573e+06 1.325573e+06 1.325573e+06 1.325573e+06
mean 6.627860e+05 6.627889e+05 4.847895e+09 1.469009e+01
std 3.826601e+05 3.826619e+05 2.422313e+09 1.205539e+01
min 0.000000e+00 0.000000e+00 1.503960e+09 6.000000e+00
25% 3.313930e+05 3.313950e+05 2.320127e+09 1.000000e+01
50% 6.627860e+05 6.627880e+05 4.445115e+09 1.000000e+01
75% 9.941790e+05 9.941820e+05 6.962181e+09 1.100000e+01
max 1.325572e+06 1.325579e+06 8.877689e+09 1.570000e+02
-------------------------------------------
archive: minuteSleep_merged.csv
Description:
Unnamed: 0 Id value logId
count 187978.000000 1.879780e+05 187978.000000 1.879780e+05
mean 94243.300796 4.997443e+09 1.095937 1.149589e+10
std 54499.126136 2.069872e+09 0.328912 6.820112e+07
min 0.000000 1.503960e+09 1.000000 1.137223e+10
25% 46994.250000 3.977334e+09 1.000000 1.143931e+10
50% 93988.500000 4.702922e+09 1.000000 1.150114e+10
75% 141525.750000 6.962181e+09 1.000000 1.155221e+10
max 188520.000000 8.792010e+09 3.000000 1.161625e+10
-------------------------------------------
archive: minuteStepsNarrow_merged.csv
Description:
Id Steps
count 1.325580e+06 1.325580e+06
mean 4.847898e+09 5.336192e+00
std 2.422313e+09 1.812830e+01
min 1.503960e+09 0.000000e+00
25% 2.320127e+09 0.000000e+00
50% 4.445115e+09 0.000000e+00
75% 6.962181e+09 0.000000e+00
max 8.877689e+09 2.200000e+02
-------------------------------------------
archive: minuteStepsWide_merged.csv
Description:
Id Steps00 Steps01 Steps02 Steps03 \
count 2.164500e+04 21645.000000 21645.000000 21645.000000 21645.000000
mean 4.836965e+09 5.304366 5.335412 5.531439 5.469439
std 2.424088e+09 17.783331 17.678358 18.079791 18.106414
min 1.503960e+09 0.000000 0.000000 0.000000 0.000000
25% 2.320127e+09 0.000000 0.000000 0.000000 0.000000
50% 4.445115e+09 0.000000 0.000000 0.000000 0.000000
75% 6.962181e+09 0.000000 0.000000 0.000000 0.000000
max 8.877689e+09 186.000000 180.000000 182.000000 182.000000
Steps04 Steps05 Steps06 Steps07 Steps08 \
count 21645.000000 21645.000000 21645.000000 21645.000000 21645.00000
mean 5.461862 5.590252 5.559483 5.412474 5.35879
std 18.288469 18.565165 18.484912 18.335665 18.20523
min 0.000000 0.000000 0.000000 0.000000 0.00000
25% 0.000000 0.000000 0.000000 0.000000 0.00000
50% 0.000000 0.000000 0.000000 0.000000 0.00000
75% 0.000000 0.000000 0.000000 0.000000 0.00000
max 181.000000 180.000000 181.000000 183.000000 180.00000
... Steps50 Steps51 Steps52 Steps53 \
count ... 21645.000000 21645.000000 21645.000000 21645.000000
mean ... 5.329175 5.194456 5.225595 5.145484
std ... 17.870527 17.601857 17.618497 17.570195
min ... 0.000000 0.000000 0.000000 0.000000
25% ... 0.000000 0.000000 0.000000 0.000000
50% ... 0.000000 0.000000 0.000000 0.000000
75% ... 0.000000 0.000000 0.000000 0.000000
max ... 182.000000 181.000000 181.000000 181.000000
Steps54 Steps55 Steps56 Steps57 Steps58 \
count 21645.000000 21645.000000 21645.000000 21645.000000 21645.000000
mean 5.223654 5.281220 5.179533 5.251836 5.143636
std 17.684634 17.828413 17.569268 17.686583 17.427494
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000
max 184.000000 181.000000 182.000000 182.000000 180.000000
Steps59
count 21645.000000
mean 5.288935
std 17.721454
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 189.000000
[8 rows x 61 columns]
-------------------------------------------
archive: sleepDay_merged.csv
Description:
Unnamed: 0 Id TotalSleepRecords TotalMinutesAsleep \
count 410.000000 4.100000e+02 410.000000 410.000000
mean 205.643902 4.994963e+09 1.119512 419.173171
std 119.470511 2.060863e+09 0.346636 118.635918
min 0.000000 1.503960e+09 1.000000 58.000000
25% 102.250000 3.977334e+09 1.000000 361.000000
50% 205.500000 4.702922e+09 1.000000 432.500000
75% 308.750000 6.962181e+09 1.000000 490.000000
max 412.000000 8.792010e+09 3.000000 796.000000
TotalTimeInBed
count 410.000000
mean 458.482927
std 127.455140
min 61.000000
25% 403.750000
50% 463.000000
75% 526.000000
max 961.000000
-------------------------------------------
archive: weightLogInfo_merged.csv
Description:
Unnamed: 0 Id WeightKg WeightPounds BMI \
count 67.000000 6.700000e+01 67.000000 67.000000 67.000000
mean 33.000000 7.009282e+09 72.035821 158.811801 25.185224
std 19.485037 1.950322e+09 13.923206 30.695415 3.066963
min 0.000000 1.503960e+09 52.599998 115.963147 21.450001
25% 16.500000 6.962181e+09 61.400002 135.363832 23.959999
50% 33.000000 6.962181e+09 62.500000 137.788914 24.389999
75% 49.500000 8.877689e+09 85.049999 187.503152 25.559999
max 66.000000 8.877689e+09 133.500000 294.317120 47.540001
LogId
count 6.700000e+01
mean 1.461772e+12
std 7.829948e+08
min 1.460444e+12
25% 1.461079e+12
50% 1.461802e+12
75% 1.462375e+12
max 1.463098e+12
-------------------------------------------
from the archive: dailyCalories_merged.csv
We can se that there is a minimum of 0 calories that someoned burned in a day. Ok, it's impossible. These data needs to be erased from the dataset.
From the archive: dailyIntensities_merged.csv
The sum of the columns: SedentaryMinutes + LightlyActiveMinutes + FairlyActiveMinutes + VeryActiveMinutes must be 1440 (the total minutes of a day).
In some days, the total sum isn't equal to 1440, maybe this is because some battery discharge of the gadgets, or the data was altered.
Evaluating the number of rows that doesn't sum 1440, we keep with 462 (49.1% of the data). So I will continue with this data
From the archive: dailyActivity_merged.csv
All the days with 0 steps will be excluded. Probably on those days the volunteers did not use the gadget. There are 77 rows with 0 TotalSteps data.
From the archive: dailySteps_merged.csv
The same situation as the archive above. Deleted the data with 0 steps.
From the archive: hourlySteps_merged.csv
Here, the 0 steps value has significant, but some days with a total of 0 values no. Days that sum a total of 0 steps will be deleted, since they are of no use to us (Maybe these days make difference in the analysis phase, so I will keep with them and come back later if I didn't find anything).
From the archive: minuteMETsNarrow_merged.csv
In some reasearches and "some" help of ChatGPT (that gave me the wrong information), there are no activity that we, human, execute that costs less than 0.95.
From the Compendium of Physical Activities:
Sleep has a value of 0.95 METs. So, every 0 METs value in the dataset will be deleted.
To anyone that want to know what MET is, MET is the acronym to Metabolic Equivalent of Task. It is a unit that measures how much energy an activity consumes compared to being at rest.
I get this information from: https://www.omnicalculator.com/sports/met-minutes-per-week#met-definition
From the archive: weightLogInfo_merged.csv
We must remove the column Fat, since has only 2 values.
# Analyzing dailyIntensities_merged.csv
dataset = 'Data_Coursera_CaseStudy02/dailyIntensities_merged.csv'
df_analysis = pd.read_csv(dataset)
df_analysis['Sum'] = df_analysis['SedentaryMinutes'] + df_analysis['LightlyActiveMinutes'] + df_analysis['FairlyActiveMinutes'] + df_analysis['VeryActiveMinutes']
len(df_analysis[df_analysis['Sum']!=1440])
462
# Analyzing dailyActivity_merged.csv
dataset = 'Data_Coursera_CaseStudy02/dailyActivity_merged.csv'
df_analysis = pd.read_csv(dataset)
df_analysis
# df_analysis['TotalSteps'] = df_analysis['SedentaryMinutes'] + df_analysis['LightlyActiveMinutes'] + df_analysis['FairlyActiveMinutes'] + df_analysis['VeryActiveMinutes']
df_analysis = df_analysis[df_analysis['TotalSteps']!=0]
df_analysis.to_csv( 'Data_Coursera_CaseStudy02/dailyActivity_merged.csv')
# Analyzing dailySteps_merged.csv
dataset = 'Data_Coursera_CaseStudy02/dailySteps_merged.csv'
df_analysis = pd.read_csv(dataset)
df_analysis
df_analysis = df_analysis[df_analysis['StepTotal']!=0]
df_analysis.to_csv( 'Data_Coursera_CaseStudy02/dailySteps_merged.csv')
# Analyzing minuteMETsNarrow_merged.csv
dataset = 'Data_Coursera_CaseStudy02/minuteMETsNarrow_merged.csv'
df_analysis = pd.read_csv(dataset)
# df_analysis
# df_analysis['TotalSteps'] = df_analysis['SedentaryMinutes'] + df_analysis['LightlyActiveMinutes'] + df_analysis['FairlyActiveMinutes'] + df_analysis['VeryActiveMinutes']
df_analysis =df_analysis[df_analysis['METs']!=0]
df_analysis.to_csv( 'Data_Coursera_CaseStudy02/minuteMETsNarrow_merged.csv')
# Removing the column 'Fat' from 'weightLogInfo_merged.csv' :
dataset = 'Data_Coursera_CaseStudy02/weightLogInfo_merged.csv'
df_drop_column = pd.read_csv(dataset)
df_drop_column.drop(columns=['Fat'],inplace=True)
df_drop_column.to_csv(dataset)
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[12], line 4 2 dataset = 'Data_Coursera_CaseStudy02/weightLogInfo_merged.csv' 3 df_drop_column = pd.read_csv(dataset) ----> 4 df_drop_column.drop(columns=['Fat'],inplace=True) 5 df_drop_column.to_csv(dataset) File ~\anaconda3\lib\site-packages\pandas\util\_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs) 325 if len(args) > num_allow_args: 326 warnings.warn( 327 msg.format(arguments=_format_argument_list(allow_args)), 328 FutureWarning, 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) File ~\anaconda3\lib\site-packages\pandas\core\frame.py:5399, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors) 5251 @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "labels"]) 5252 def drop( # type: ignore[override] 5253 self, (...) 5260 errors: IgnoreRaise = "raise", 5261 ) -> DataFrame | None: 5262 """ 5263 Drop specified labels from rows or columns. 5264 (...) 5397 weight 1.0 0.8 5398 """ -> 5399 return super().drop( 5400 labels=labels, 5401 axis=axis, 5402 index=index, 5403 columns=columns, 5404 level=level, 5405 inplace=inplace, 5406 errors=errors, 5407 ) File ~\anaconda3\lib\site-packages\pandas\util\_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs) 325 if len(args) > num_allow_args: 326 warnings.warn( 327 msg.format(arguments=_format_argument_list(allow_args)), 328 FutureWarning, 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) File ~\anaconda3\lib\site-packages\pandas\core\generic.py:4505, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors) 4503 for axis, labels in axes.items(): 4504 if labels is not None: -> 4505 obj = obj._drop_axis(labels, axis, level=level, errors=errors) 4507 if inplace: 4508 self._update_inplace(obj) File ~\anaconda3\lib\site-packages\pandas\core\generic.py:4546, in NDFrame._drop_axis(self, labels, axis, level, errors, only_slice) 4544 new_axis = axis.drop(labels, level=level, errors=errors) 4545 else: -> 4546 new_axis = axis.drop(labels, errors=errors) 4547 indexer = axis.get_indexer(new_axis) 4549 # Case for non-unique axis 4550 else: File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:6934, in Index.drop(self, labels, errors) 6932 if mask.any(): 6933 if errors != "ignore": -> 6934 raise KeyError(f"{list(labels[mask])} not found in axis") 6935 indexer = indexer[~mask] 6936 return self.delete(indexer) KeyError: "['Fat'] not found in axis"
OK. Now we need to develop our hypothesis on what could this data show to us.
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
print(f"Average number of steps taken by day: {np.average(dailySteps['StepTotal'])} steps")
Average number of steps taken by day: 8319.39281575898 steps
Doing a research, to be considered active, a person should do 10.000 steps in a day. Of course in our lives we have to balancing active life with our jobs, but seeing this averave from all the users, we can ensure that most of them are not active people.
caloriesDay = pd.read_csv("Data_Coursera_CaseStudy02/dailyCalories_merged.csv", index_col=[0])
print(f"Average Calories by day: {np.average(caloriesDay['Calories'])} calories")
Average Calories by day: 2303.609574468085 calories
This value of burned calories in a day is in line with what research shows that both men and adult women must burn daily to maintain their weight
dailyActivity = pd.read_csv("Data_Coursera_CaseStudy02/dailyActivity_merged.csv", index_col=[0])
print(f"Average Time by day in lightly activities: {np.average(dailyActivity['LightlyActiveMinutes']):.2f} minutes")
print(f"Average Time by day in fairly activities: {np.average(dailyActivity['FairlyActiveMinutes']):.2f} minutes")
print(f"Average Time by day in very active activities: {np.average(dailyActivity['VeryActiveMinutes']):.2f} minutes")
Average Time by day in lightly activities: 210.02 minutes Average Time by day in fairly activities: 14.78 minutes Average Time by day in very active activities: 23.02 minutes
In a research with the help of the Bing AI:
According to the World Health Organization (WHO), adults should spend at least 180 minutes in a variety of types of physical activities at any intensity, of which at least 60 minutes is moderate- to vigorous-intensity physical activity, spread throughout the day; more is better. The Centers for Disease Control and Prevention (CDC) recommends that adults need 150 minutes of moderate-intensity physical activity and 2 days of muscle strengthening activity each week.
So, we can see that the average is about 210 minuites in lightly activities, that can maybe be related with work. So, we can assure that most of this users need to take more time to do more fairly and active activities.
# Bar chart: steps through the day
hourlySteps = pd.read_csv("Data_Coursera_CaseStudy02/hourlySteps_merged.csv", index_col=[0])
hourlySteps['ActivityHour'] = pd.to_datetime(hourlySteps['ActivityHour'])
hourlySteps['Day'] = hourlySteps['ActivityHour'].dt.date
hourlySteps['Time'] = hourlySteps['ActivityHour'].dt.time
stepsMean_By_hour = []
for hour in (hourlySteps['Time'].unique()):
stepsMean_By_hour.append(np.average(hourlySteps[hourlySteps['Time']==hour]['StepTotal']))
# df['datetime'] = pd.to_datetime(df['datetime'])
data = {'hours': hourlySteps['Time'].unique(),
'Average steps': stepsMean_By_hour}
fig = px.bar(data, x='hours', y='Average steps', title = 'Average steps by hour')
fig.update_layout(xaxis=dict(
title = 'Hours',
),
yaxis=dict(
title='Average steps',
side='left'
)
)
fig.show()
Here we can see something that would be expected, most of the steps are in the period of the work times and after it, maybe time to go away or go to the gym?
Users with an average steps more than 10.000 were classified as active.
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
dailySteps = dailySteps.groupby(by=['Id']).mean()
status_list = []
for id in dailySteps.index:
if(dailySteps.loc[id]['StepTotal']>=10000):
status_list.append('Active')
else:
status_list.append('Not Active')
dailySteps['Status'] = status_list
data = {'Status': ['Active', 'Not Active'],
'number of status': [ len(dailySteps[dailySteps['Status']=='Active']), len(dailySteps[dailySteps['Status']=='Not Active'])]}
fig = px.pie(data, values='number of status', names='Status', title='Percentage of active users based on 10.000 steps daily')
fig.show()
# print(f"Average number of steps taken by day: {np.average(dailySteps['StepTotal'])} steps")
C:\Users\Spuck\AppData\Local\Temp\ipykernel_19412\663014179.py:3: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
As evidenced above in the average daily steps calculation. When we separate the calculation by user, we can see that almost 80% of them do not do 10.000 steps by day.
# Bar chart: intensities through the day
hourlyIntensities = pd.read_csv("Data_Coursera_CaseStudy02/hourlyIntensities_merged.csv", index_col=[0])
hourlyIntensities['ActivityHour'] = pd.to_datetime(hourlyIntensities['ActivityHour'])
hourlyIntensities['Day'] = hourlyIntensities['ActivityHour'].dt.date
hourlyIntensities['Time'] = hourlyIntensities['ActivityHour'].dt.time
stepsMean_By_hour = []
for hour in (hourlyIntensities['Time'].unique()):
stepsMean_By_hour.append(np.average(hourlyIntensities[hourlyIntensities['Time']==hour]['TotalIntensity']))
# df['datetime'] = pd.to_datetime(df['datetime'])
data = {'hours': hourlyIntensities['Time'].unique(),
'Average Intensities': stepsMean_By_hour}
fig = px.bar(data, x='hours', y='Average Intensities', title = 'Average Intensities by hour')
fig.update_layout(xaxis=dict(
title = 'Hours',
),
yaxis=dict(
title='Average Intensities',
side='left'
)
)
fig.show()
This follows the steps by hour. We can do the same analysis as the previous.
# Bar chart: Average time by type of activity
hourlyCalories = pd.read_csv("Data_Coursera_CaseStudy02/hourlyCalories_merged.csv", index_col=[0])
hourlyCalories['ActivityHour'] = pd.to_datetime(hourlyCalories['ActivityHour'])
hourlyCalories['Day'] = hourlyCalories['ActivityHour'].dt.date
hourlyCalories['Time'] = hourlyCalories['ActivityHour'].dt.time
caloriesAVG_By_hour = []
for hour in (hourlyCalories['Time'].unique()):
caloriesAVG_By_hour.append(np.average(hourlyCalories[hourlyCalories['Time']==hour]['Calories']))
# df['datetime'] = pd.to_datetime(df['datetime'])
data = {'hours': hourlyCalories['Time'].unique(),
'Average calories': caloriesAVG_By_hour}
fig = px.bar(data, x='hours', y='Average calories', title = 'Average calories by hour')
fig.update_layout(xaxis=dict(
title = 'Hours',
),
yaxis=dict(
title='Average calories',
side='left'
)
)
fig.show()
Also, we can do the same analysis as the previous. The most calories are burned in the work hours.
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
sleepDay = pd.read_csv("Data_Coursera_CaseStudy02/sleepDay_merged.csv", index_col=[0])
# Remove the time data from sleepDay Dataframe
sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')
# Renaming the columns
sleepDay.rename(columns={'SleepDay': 'Day'}, inplace=True)
dailySteps.rename(columns={'ActivityDay': 'Day'}, inplace=True)
# New dataframe
df_analysis_sleep_steps = pd.merge(dailySteps, sleepDay, on = ['Id', 'Day'])
# df_analysis_sleep_steps
# Plot
fig = px.scatter(x=df_analysis_sleep_steps['TotalMinutesAsleep'], y=df_analysis_sleep_steps['StepTotal'], title="Minutes Asleep vs Steps total")
fig.update_layout(legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1,
title=''
),xaxis=dict(
title = 'Minutes Asleep',
),
yaxis=dict(
title='Total Steps in a day',
side='left',
))
fig.show()
In this plot we can see that most of the users concentrate between ~440 minutes of sleep and ~10k steps per day.
It doesn't appear to exist a relation in the two data.
sleepDay = pd.read_csv("Data_Coursera_CaseStudy02/sleepDay_merged.csv", index_col=[0])
# Remove the time data from sleepDay Dataframe
sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')
sleepDay
fig = px.scatter(x=sleepDay['TotalMinutesAsleep'], y=sleepDay['TotalTimeInBed'], title="Minutes Asleep vs Total time in Bed")
fig.update_layout(legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1,
title=''
),xaxis=dict(
title = 'Minutes Asleep',
),
yaxis=dict(
title='Total Time in Bed',
side='left',
))
fig.show()
An exactly linear relation. But kind of obvius.
# Plot
fig = px.scatter(x=df_analysis_sleep_steps['TotalSleepRecords'], y=df_analysis_sleep_steps['StepTotal'], title="Minutes Asleep vs Steps total")
fig.update_layout(legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1,
title=''
),xaxis=dict(
title = 'Total Sleep Records',
),
yaxis=dict(
title='Total Steps in a day',
side='left',
))
fig.show()
Also, nothing that we can confirm.
If we had more information of people with 3 sleeping records we could affirm that those who doesn't appear to sleep well tends to take less steps per day.
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
caloriesDay = pd.read_csv("Data_Coursera_CaseStudy02/dailyCalories_merged.csv", index_col=[0])
# Remove the time data from sleepDay Dataframe
# sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')
# Renaming the columns
caloriesDay.rename(columns={'ActivityDay': 'Day'}, inplace=True)
dailySteps.rename(columns={'ActivityDay': 'Day'}, inplace=True)
# New dataframe
df_analysis_calories_steps = pd.merge(dailySteps, caloriesDay, on = ['Id', 'Day'])
# df_analysis_sleep_steps
# Plot
fig = px.scatter(x=df_analysis_calories_steps['Calories'], y=df_analysis_calories_steps['StepTotal'], title="Calories vs Steps total")
fig.update_layout(legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1,
title=''
),xaxis=dict(
title = 'Calories spent in a Day',
),
yaxis=dict(
title='Total Steps in a day',
side='left',
))
fig.show()
A well correlation. The more steps you do in your day, the more calories are spent.
We have some outliers, but that can be:
dailyActivity = pd.read_csv("Data_Coursera_CaseStudy02/dailyActivity_merged.csv", index_col=[0])
sleepDay = pd.read_csv("Data_Coursera_CaseStudy02/sleepDay_merged.csv", index_col=[0])
# Remove the time data from sleepDay Dataframe
sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')
# Renaming the columns
dailyActivity.rename(columns={'ActivityDate': 'Day'}, inplace=True)
sleepDay.rename(columns={'SleepDay': 'Day'}, inplace=True)
# Merging the Dataframes
df_analysis_calories_steps = pd.merge(dailyActivity, sleepDay, on = ['Id', 'Day'])
df_analysis_calories_steps
| Unnamed: 0.1 | Unnamed: 0 | Id | Day | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0.0 | 1.88 | 0.55 | 6.06 | 0.0 | 25 | 13 | 328 | 728 | 1985 | 1 | 327 | 346 |
| 1 | 1 | 1 | 1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0.0 | 1.57 | 0.69 | 4.71 | 0.0 | 21 | 19 | 217 | 776 | 1797 | 2 | 384 | 407 |
| 2 | 3 | 3 | 1503960366 | 4/15/2016 | 9762 | 6.28 | 6.28 | 0.0 | 2.14 | 1.26 | 2.83 | 0.0 | 29 | 34 | 209 | 726 | 1745 | 1 | 412 | 442 |
| 3 | 4 | 4 | 1503960366 | 4/16/2016 | 12669 | 8.16 | 8.16 | 0.0 | 2.71 | 0.41 | 5.04 | 0.0 | 36 | 10 | 221 | 773 | 1863 | 2 | 340 | 367 |
| 4 | 5 | 5 | 1503960366 | 4/17/2016 | 9705 | 6.48 | 6.48 | 0.0 | 3.19 | 0.78 | 2.51 | 0.0 | 38 | 20 | 164 | 539 | 1728 | 1 | 700 | 712 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 405 | 827 | 898 | 8792009665 | 4/30/2016 | 7174 | 4.59 | 4.59 | 0.0 | 0.33 | 0.36 | 3.91 | 0.0 | 10 | 20 | 301 | 749 | 2896 | 1 | 343 | 360 |
| 406 | 828 | 899 | 8792009665 | 5/1/2016 | 1619 | 1.04 | 1.04 | 0.0 | 0.00 | 0.00 | 1.04 | 0.0 | 0 | 0 | 79 | 834 | 1962 | 1 | 503 | 527 |
| 407 | 829 | 900 | 8792009665 | 5/2/2016 | 1831 | 1.17 | 1.17 | 0.0 | 0.00 | 0.00 | 1.17 | 0.0 | 0 | 0 | 101 | 916 | 2015 | 1 | 415 | 423 |
| 408 | 830 | 901 | 8792009665 | 5/3/2016 | 2421 | 1.55 | 1.55 | 0.0 | 0.00 | 0.00 | 1.55 | 0.0 | 0 | 0 | 156 | 739 | 2297 | 1 | 516 | 545 |
| 409 | 831 | 902 | 8792009665 | 5/4/2016 | 2283 | 1.46 | 1.46 | 0.0 | 0.00 | 0.00 | 1.46 | 0.0 | 0 | 0 | 129 | 848 | 2067 | 1 | 439 | 463 |
410 rows × 20 columns
# Figure Subplots
fig = make_subplots(rows=1, cols=3 ,shared_yaxes=True)
fig.add_trace(
go.Scatter(x=df_analysis_calories_steps['LightlyActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Lightly Activities'),
row=1, col=1,
)
fig.add_trace(
go.Scatter(x=df_analysis_calories_steps['FairlyActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Fairly Active'),
row=1, col=2
)
fig.add_trace(
go.Scatter(x=df_analysis_calories_steps['VeryActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Very Active'),
row=1, col=3
)
fig.update_layout(height=600, width=900, title_text="Activity Time vs Sleep Time ", hovermode="x unified")
fig.update_layout(legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1,
title=''
),xaxis=dict(
title = 'Time spent in lightly activities',
),
yaxis=dict(
title='Time Slept',
side='left',
))
fig.update_xaxes(row=1, col=2, title = 'Time spent in fairly activities')
fig.update_xaxes(row=1, col=3, title = 'Time spent in very active activities')
# fig.update_traces(connectgaps=False)
fig.show()
Here we can see a preference to lightly activities instead of fairly or very active activities.
Again, most of those people sleep ~440 minutes a day (7 hours and 20 minutes).
Bute, we also have many points below 300 minutes (5 hours), and that is not healty.
# Density plots from the plot above:
# Figure Subplots
fig = make_subplots(rows=1, cols=3 ,shared_yaxes=True)
fig.add_trace(
go.Histogram2dContour(x=df_analysis_calories_steps['LightlyActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], colorscale = 'Blues',name='Lightly Activities'),
row=1, col=1,
)
fig.add_trace(
go.Histogram2dContour(x=df_analysis_calories_steps['FairlyActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], colorscale = 'Blues',name='Fairly Active'),
row=1, col=2
)
fig.add_trace(
go.Histogram2dContour(x=df_analysis_calories_steps['VeryActiveMinutes'], y=df_analysis_calories_steps['TotalMinutesAsleep'], colorscale = 'Blues',name='Very Active'),
row=1, col=3
)
fig.update_layout(height=800, width=900, title_text="Activity Time vs Sleep Time ", hovermode="x unified")
fig.update_layout(legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1,
title=''
),xaxis=dict(
title = 'Time spent in lightly activities',
range = [0,450]
),
yaxis=dict(
title='Time Slept',
side='left',
))
fig.update_xaxes(row=1, col=2, title = 'Time spent in fairly activities', range = [-5.5,55])
fig.update_xaxes(row=1, col=3, title = 'Time spent in very active activities', range = [-10,85])
# fig.update_traces(connectgaps=False)
Same as above, but with a density plot. We can see that the preference in lightly activities exists.
# Figure Subplots
fig = make_subplots(rows=1, cols=3 ,shared_yaxes=True)
fig.add_trace(
go.Scatter(x=df_analysis_calories_steps['LightActiveDistance'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Lightly Activities Distance'),
row=1, col=1,
)
fig.add_trace(
go.Scatter(x=df_analysis_calories_steps['ModeratelyActiveDistance'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Moderately Active Distance'),
row=1, col=2
)
fig.add_trace(
go.Scatter(x=df_analysis_calories_steps['VeryActiveDistance'], y=df_analysis_calories_steps['TotalMinutesAsleep'], mode='markers',name='Very Active Distance'),
row=1, col=3
)
fig.update_layout(height=600, width=900, title_text="Distance in activities vs Sleep Time ", hovermode="x unified")
fig.update_layout(legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1,
title=''
),xaxis=dict(
title = 'Distance in lightly activities',
),
yaxis=dict(
title='Time Slept',
side='left',
))
fig.update_xaxes(row=1, col=2, title = 'Distance in moderately activities')
fig.update_xaxes(row=1, col=3, title = 'Distance in very active activities')
# fig.update_traces(connectgaps=False)
fig.show()
Same as Time in activities vs Sleep Time.
dailyActivity = pd.read_csv("Data_Coursera_CaseStudy02/dailyActivity_merged.csv", index_col=[0])
caloriesDay = pd.read_csv("Data_Coursera_CaseStudy02/dailyCalories_merged.csv", index_col=[0])
# Remove the time data from sleepDay Dataframe
# sleepDay['SleepDay'] = sleepDay['SleepDay'].str.replace(' 12:00:00 AM','')
# Renaming the columns
caloriesDay.rename(columns={'ActivityDay': 'Day'}, inplace=True)
# Renaming the columns
dailyActivity.rename(columns={'ActivityDate': 'Day'}, inplace=True)
# Merging the Dataframes
df_analysis_calories_activities = pd.merge(dailyActivity, caloriesDay, on = ['Id', 'Day'])
df_analysis_calories_activities
| Unnamed: 0.1 | Unnamed: 0 | Id | Day | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories_x | Calories_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1503960366 | 4/12/2016 | 13162 | 8.500000 | 8.500000 | 0.0 | 1.88 | 0.55 | 6.06 | 0.00 | 25 | 13 | 328 | 728 | 1985 | 1985 |
| 1 | 1 | 1 | 1503960366 | 4/13/2016 | 10735 | 6.970000 | 6.970000 | 0.0 | 1.57 | 0.69 | 4.71 | 0.00 | 21 | 19 | 217 | 776 | 1797 | 1797 |
| 2 | 2 | 2 | 1503960366 | 4/14/2016 | 10460 | 6.740000 | 6.740000 | 0.0 | 2.44 | 0.40 | 3.91 | 0.00 | 30 | 11 | 181 | 1218 | 1776 | 1776 |
| 3 | 3 | 3 | 1503960366 | 4/15/2016 | 9762 | 6.280000 | 6.280000 | 0.0 | 2.14 | 1.26 | 2.83 | 0.00 | 29 | 34 | 209 | 726 | 1745 | 1745 |
| 4 | 4 | 4 | 1503960366 | 4/16/2016 | 12669 | 8.160000 | 8.160000 | 0.0 | 2.71 | 0.41 | 5.04 | 0.00 | 36 | 10 | 221 | 773 | 1863 | 1863 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 858 | 858 | 935 | 8877689391 | 5/8/2016 | 10686 | 8.110000 | 8.110000 | 0.0 | 1.08 | 0.20 | 6.80 | 0.00 | 17 | 4 | 245 | 1174 | 2847 | 2847 |
| 859 | 859 | 936 | 8877689391 | 5/9/2016 | 20226 | 18.250000 | 18.250000 | 0.0 | 11.10 | 0.80 | 6.24 | 0.05 | 73 | 19 | 217 | 1131 | 3710 | 3710 |
| 860 | 860 | 937 | 8877689391 | 5/10/2016 | 10733 | 8.150000 | 8.150000 | 0.0 | 1.35 | 0.46 | 6.28 | 0.00 | 18 | 11 | 224 | 1187 | 2832 | 2832 |
| 861 | 861 | 938 | 8877689391 | 5/11/2016 | 21420 | 19.559999 | 19.559999 | 0.0 | 13.22 | 0.41 | 5.89 | 0.00 | 88 | 12 | 213 | 1127 | 3832 | 3832 |
| 862 | 862 | 939 | 8877689391 | 5/12/2016 | 8064 | 6.120000 | 6.120000 | 0.0 | 1.82 | 0.04 | 4.25 | 0.00 | 23 | 1 | 137 | 770 | 1849 | 1849 |
863 rows × 18 columns
# Figure Subplots
fig = make_subplots(rows=1, cols=3 ,shared_yaxes=True)
fig.add_trace(
go.Scatter(x=dailyActivity['LightlyActiveMinutes'], y=dailyActivity['Calories'], mode='markers',name='Lightly Activities'),
row=1, col=1,
)
fig.add_trace(
go.Scatter(x=dailyActivity['FairlyActiveMinutes'], y=dailyActivity['Calories'], mode='markers',name='Fairly Active'),
row=1, col=2
)
fig.add_trace(
go.Scatter(x=dailyActivity['VeryActiveMinutes'], y=dailyActivity['Calories'], mode='markers',name='Very Active'),
row=1, col=3
)
fig.update_layout(height=600, width=900, title_text="Activity Time vs Calories", hovermode="x unified")
fig.update_layout(legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1,
title=''
),xaxis=dict(
title = 'Time spent in lightly activities',
),
yaxis=dict(
title='Calories',
side='left',
))
fig.update_xaxes(row=1, col=2, title = 'Time spent in fairly activities')
fig.update_xaxes(row=1, col=3, title = 'Time spent in very active activities')
# fig.update_traces(connectgaps=False)
fig.show()
# Bar chart: Average time by type of activity
data = {'labels': ['Ligthly', 'Fairly', 'Very active'],
'Mean time in activities': [np.average(dailyActivity['LightlyActiveMinutes']), np.average(dailyActivity['FairlyActiveMinutes']), np.average(dailyActivity['VeryActiveMinutes'])]}
fig = px.bar(data, x='labels', y='Mean time in activities', title = 'Average time by type of activity')
fig.update_layout(xaxis=dict(
title = 'Type of Activity',
),
yaxis=dict(
title='Mean time (minutes)',
side='left'
)
)
fig.show()
dailySteps = pd.read_csv("Data_Coursera_CaseStudy02/dailySteps_merged.csv", index_col=[0])
weightLogInfo = pd.read_csv("Data_Coursera_CaseStudy02/weightLogInfo_merged.csv", index_col=[0])
dailySteps = dailySteps.groupby(by=['Id']).mean()
dailySteps.reset_index(inplace=True)
df_analysis_steps_BMI = pd.merge(dailySteps, weightLogInfo, on = ['Id'])
df_analysis_steps_BMI
fig = px.scatter(x=df_analysis_steps_BMI['BMI'], y=df_analysis_steps_BMI['StepTotal'], title="BMI vs Average Steps total")
fig.update_layout(legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1,
title=''
),xaxis=dict(
title = 'BMI (Body Mass Index)',
),
yaxis=dict(
title='Total Steps in a day',
side='left',
))
fig.show()
# data = {'Status': ['Active', 'Not Active'],
# 'number of status': [ len(dailySteps[dailySteps['Status']=='Active']), len(dailySteps[dailySteps['Status']=='Not Active'])]}
# fig = px.pie(data, values='number of status', names='Status', title='Percentage of active users based on 10.000 steps daily')
# fig.show()
# print(f"Average number of steps taken by day: {np.average(dailySteps['StepTotal'])} steps")
C:\Users\Spuck\AppData\Local\Temp\ipykernel_19412\1163727648.py:6: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
Ok. We do not have much data, but we can presume that people with BMI higher than 40 really has difficulty to do more exercises.
It is important to note that the data analyzed in this study only represents a partial sample of Bellabeat clients, and therefore cannot be generalized to the entire population. The sample size of 7~33 people is not large enough to assume that the behaviors observed are representative of the broader population. As such, any conclusions or insights drawn from this data should be considered with caution and should not be applied to the entire Bellabeat client base without further research and analysis. It is important to gather a more significant sample size to accurately reflect the population's behavior and make informed decisions based on the data collected.
Most os these users can't be considered active, with less than 10.000 steps per day;
Some users do not sleep well;
Most of the users do not have a habit to do exercises;
The most active time is from 17h to 20h;
People with high BMI tends to make less exercise;
Based on the trends and hypothesis that has been found, I recommend create new features to the ibtegration of Bellabeat Leaf with Bellabeat App and/ or Bellabeat membership.
New sleeping traking system:
Game mode of steps daily task and activity time (with rewards):
Social media share options on app:
In conclusion, I believe that more research and analysis is needed to better understand the specific needs and behaviors of Bellabeat users in order to continue to improve and tailor the product to meet their needs.